{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true,
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# クイックスタート: ReazonSpeech v2.0で音声認識する"
      ],
      "metadata": {
        "id": "gOHDZ45jDwaG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "このチュートリアルでは、ReazonSpeech v2.0の深層学習モデルを使って音声認識を行います。\n",
        "\n",
        "（Colabのメニューから「ランタイム > ランタイムタイプの変更」でGPUを選択ください）"
      ],
      "metadata": {
        "id": "39F7gDDxEEIZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 最初のセットアップ\n",
        "\n",
        "最初に、ReazonSpeechをインストールして、環境をセットアップします。"
      ],
      "metadata": {
        "id": "Ub-eqNS9GM-Q"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6RDMBcSnAvAW"
      },
      "outputs": [],
      "source": [
        "!apt-get install libsndfile1 ffmpeg\n",
        "!git clone https://github.com/reazon-research/reazonspeech\n",
        "!pip install --no-warn-conflicts reazonspeech/pkg/nemo-asr"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## サンプル音声をダウンロード\n",
        "\n",
        "次のステップで使う音声ファイルを取得します。"
      ],
      "metadata": {
        "id": "tqu-h6PhGTQ0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!curl -O https://research.reazon.jp/_static/demo.mp3\n",
        "\n",
        "from IPython.display import Audio, display\n",
        "display(Audio(\"demo.mp3\"))"
      ],
      "metadata": {
        "id": "QhDP0IBrGbJL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 実際に音声認識を実行する\n",
        "\n",
        "ReazonSpeechのモデルをロードして、実際に音声を文字起こししてみましょう。"
      ],
      "metadata": {
        "id": "Q6wDpzqHELYE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from reazonspeech.nemo.asr import transcribe, audio_from_path, load_model\n",
        "\n",
        "# ReazonSpeechモデルをHugging Faceから取得\n",
        "model = load_model()"
      ],
      "metadata": {
        "id": "r3eVjQumENnd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# 音声にモデルを適用する\n",
        "audio = audio_from_path(\"demo.mp3\")\n",
        "ret = transcribe(model, audio)\n",
        "\n",
        "# 結果を出力\n",
        "print(\"\\n## Result:\")\n",
        "print(ret.text)"
      ],
      "metadata": {
        "id": "PIMeJ25QlT_D"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "結果が出力されたら成功です！"
      ],
      "metadata": {
        "id": "w-y_mhZMEXB3"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## もっと認識結果を活用する\n",
        "\n",
        "ReazonSpeechを使うと、発話セグメント単位の時刻情報を取得できます。"
      ],
      "metadata": {
        "id": "fIwe-YGoLh7B"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for seg in ret.segments:\n",
        "  print(\"%5.2f %5.2f %s\" % (seg.start_seconds, seg.end_seconds, seg.text))"
      ],
      "metadata": {
        "id": "7oHyyHBJLp4s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "さらに、もっとも小さいサブワード単位のタイミングも取得できます。"
      ],
      "metadata": {
        "id": "DglzhjwCMefd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for word in ret.subwords[1:10]:\n",
        "  print(\"%5.2f %s\" % (word.seconds, word.token))"
      ],
      "metadata": {
        "id": "X5wpl7-YMk18"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}